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Summary 

In data sets with many more features than observations, independent screening based on all 
univariate regression models leads to a computationally convenient variable selection method. 
Recent efforts have shown that in the case of generaUzed linear models, independent screening 
may suffice to capture all relevant features with high probabiUty, even in ultra-high dimen- 
sion. It is unclear whether this formal sure screening property is attainable when the response 
is a right-censored survival time. We propose a computationally very efficient independent 
screening method for survival data which can be viewed as the natural survival equivalent of 
correlation screening. We state conditions under which the method admits the sure screening 
property within a general class of single-index hazard rate models with ultra-high dimensional 
features. An iterative variant is also described which combines screening with penalized regres- 
sion in order to handle more complex feature covariance structures. The methods are evaluated 
through simulation studies and through application to a real gene expression data set. 



1 Introduction 

With the increasing proliferation of biomarker studies, there is a need for efficient methods for re- 
lating a survival time response to a large number of features. In typical genetic microarray studies, 
the sample size n is measured in hundreds whereas the number of features p per sample can be in 
excess of millions. Sparse regression techniques such as lasso (Tibshirani 1997[ ) and SCAD (Fan 



and Li 2001 1 have proved useful for dealing with such high-dimensional features but their useful- 
ness diminishes when p becomes extremely large compared to n. The notion of NP-dimensionality 
( Fan and Lv[ 2009| ) has been conceived to describe such ultra-high dimensional settings which 



are formally analyzed in an asymptotic regime where p grows at a non-polynomial rate with n. 
Despite recent progress ( [Bradic et al. 201 1[ ), theoretical knowledge about sparse regression tech- 



niques under NP-dimensionality is still in its infancy. Moreover, NP-dimensionality poses substan- 
tial computational challenges. When for example pairwise interactions among gene expressions 
in a genetic microarray study are of interest, the dimension of the feature space will trouble even 



*Department of Mathematical Sciences, Aalborg University, Denmark. Email: gorst@math.aau.dk 
^Department of Biostatistics, University of Copenhagen, Denmark 



1 



the most efficient algorithms for fitting sparse regression models. A popular ad hoc solution is to 
simply pretend that feature correlations are ignorable and resort to computationally swift univariate 
regression methods; so-called independent screening methods. 

In an important paper, [Fan and Lv (20081 laid the formal foundation for using independent 
screening to distinguish 'relevant' features from 'irrelevant' ones. For the linear regression model 
they showed that, when the design is close to orthogonal, a superset of the true set of nonzero re- 
gression coefficients can be estimated consistently by simple hard-thresholding of feature-response 
correlations. This sure independent screening (SIS) property of correlation screening is a rather 
trivial one, if not for the fact that it holds true in the asymptotic regime of NP-dimensionality. 
Thus, when the feature covariance structure is sufficiently simple, SIS methods can overcome the 
noise accumulation in extremely high dimension. In order to accommodate more complex feature 
Fan and Lv| ( |2008| l and [Fan et al\ ( |2009| l developed heuristic, iterated meth- 



covariance structures 



ods combining independent screening with forward selection techniques. Recently, [Fan and Song 



(2010 1 extended the formal basis for SIS to generalized linear models. 

In biomedical applications, the response of interest is often a right-censored survival time, mak- 



ing the study of screening methods for survival data an important one. Fan et al. (20101 investi- 
gated SIS methods for the Cox proportional hazards model based on ranking features according to 
the univariate partial log-likelihood but gave no formal justification. Tibshirani (2009]) suggested 
soft-thresholding of univariate Cox score statistics with some theoretical justification but under 
strong assumptions. Indeed, independent screening methods for survival data are apt to be difficult 
to justify theoretically due to the presence of censoring which can confound marginal associations 



between the response and the features. Recent work by Zhao and Li (2010 1 contains ideas which in- 
dicate that independent screening based on the Cox model may have the SIS property in the absence 
of censoring. 

In the present paper, we depart from the standard approach of studying SIS as a rather specific 
type of model misspecification in which the univariate versions of a particular regression model 
are used to infer the structure of the joint version of the same particular regression model. Instead, 
we propose a survival variant of independent screening based on a model-free statistic which we 
call the 'Feature Aberration at Survival Times' (FAST) statistic. The FAST statistic is a simple 
linear statistic which aggregates across survival times the aberration of each feature relative to its 
time-varying average. Independent screening based on this statistic can be regarded as a natural 
survival equivalent of correlation screening. We study the SIS property of FAST screening in ultra- 
high dimension for a general class of single-index hazard rate regression models in which the risk 
of an event depends on the features through some linear functional. A key aim has been to derive 
simple and operational sufficient conditions for the SIS property to hold. Accordingly, our main 
result states that the FAST statistic has the SIS property in an ultra-high dimensional setting under 
covariance assumptions as in |Fan et al. ( 2009| ), provided that censoring is essentially random and 
that features satisfy a technical condition which holds when they follow an elliptically contoured 
distribution. Utilizing the fact that the FAST statistic is related to the univariate regression coeffi- 



cients in the semiparametric additive hazards model (Lin and Ying ( 1994 1 ; McKeague and Sasieni 
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( 1994 1), we develop methods for iterated SIS. The techniques are evaluated in a simulation study 
where we also compare with screening methods for the Cox model (Fan et al. 2010[ ). Finally, an 
application to a real genetic microarray data set is presented. 



2 The FAST statistic and its motivation 

Let r be a survival time which is subject to right-censoring by some random variable C. Denote 
by N{t) := \ {T f\C < t f\T < C) the counting process which counts events up to time t, let Y{t) := 
1 (r A C > f ) be the at-risk process, and let Z,^W denote a random vector of explanatory variables 
or features. It is assumed throughout that Z has finite variance and is standardized, i.e. centered and 
with a covariance matrix E with unit diagonal. We observe n independent and identically distributed 
(i.i.d.) replicates of {{Ni,Yi,7ji) : < ? < t} for / = 1, . . . ,« where [0, t] is the observation time 
window. 

Define the 'Feature AbeiTation at Survival Times' (FAST) statistic as follows: 

A:=n-' fj\{Li-l{t)}mi{t); (1) 

where Z is the at-risk-average of the Z,s, 

Components of the FAST statistic define basic measures of the marginal association between each 
feature and survival. In the following, we provide two motivations for using the FAST statistic 
for screening purposes. The first, being model-based, is perhaps the most intuitive - the second 
shows that, even in a model-free setting, the FAST statistic may provide valuable information about 
marginal associations. 



2.1 A model-based interpretation of the FAST statistic 

Assume in this section that the 7]s have hazard functions of the form 

rT„0. 



AXO=^o(0 + Zya ; 7 = 1,2,...,«; 



(2) 



with Ao an unspecified baseline hazard rate and & vector of regression coefficients. This is 

the so-called semiparametric additive hazards model (Lin and Ying ( |1994| |; McKeague and Sasieni 



( 1994 1), henceforth simply the Lin- Ying model. The Lin- Ying model corresponds to assuming for 
QachNj an intensity function of the form Yj{t){'ko{t) +Zjflt°}. From the Doob-Meyer decomposi- 
tion dA^j(0 = AMj{t) + Yj{t){XQ{t) + Zja^jdf with Mj a martingale, it is easily verified that 

f {Z, -Z(0}dA^,(0 = [£{Z, -Z(f)}^2j^-(0d?la" + I{Z-Z(0}dM,-(0, t G [0,t]. (3) 

/=1 i=l i=\ 

This suggests that a" is estimable as the solution to the p>^ p linear system of equations 

d = Da; (4) 
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where 



d:=n-'y f {Zi-l{t)}dNi{t), andD:=«-'y; f Yi{t){Zi-l{t)}®^dt. (5) 



Suppose a solves Q. Standard martingale arguments (Lin and Ying 1994 1 imply root n consis- 
tency of a, 

V^(d-a°) A N(0,D-iBD-i), where B = V /V,--Z(0}®^dA^,(0- (6) 

For now, simply observe that the left-hand side of (|4]) is exactly the FAST statistic; whereas djD^j 
for j = l,2,...,p estimate the regression coefficients in the corresponding p univariate Lin- Ying 
models. Hence we can interpret d as a (scaled) estimator of the univariate regression coefficients in 
a working Lin- Ying model. 

A nice heuristic interpretation of d results from the pointwise signal/error decomposition Q 
which is essentially a reformulated linear regression model X^Xot" + X^e = X^y with 'responses' 
yj := dNj{t) and 'explanatory variables' Xj := {Zy — Z(?)}Fy(f). The FAST statistic is given by 
the time average of ^jX^y} and may accordingly be viewed as a survival equivalent of the usual 
predictor-response correlations. 

2.2 A model-free interpretation of the FAST statistic 

For a feature to be judged (marginally) associated with survival in any reasonable interpretation of 
survival data, one would first require that the feature is correlated with the probability of experienc- 
ing an event - second, that this correlation persists throughout the time window. The FAST statistic 
can be shown to reflect these two requirements when the censoring mechanism is sufficiently sim- 
ple. 

Specifically, assume administrative censoring at time T (so that Ci = T). Set V{t) := Var{F(f |Zi)}^/^ 
where F(f|Zi) := P{Ti < t\Zi) denotes the conditional probability of death before time t. For each 
j, denote by 5j the population version of dj (the in probability limit of dj when n — oo). Then 



E{Y,{t)] 

E{Z,jY,{t)} 



\{Ty <tAT) 

= V(T)Cor{Z,„F(T|Z,))+j[^'cor{Z,j,F((|Z,)}j|^£{dF((|Z,)}. 
We can make the following observations: 

(i) If Cor{Ziy,f (f|Zi)} has constant sign throughout [0,T],then \5j\ > \V{z)Cor{Zij,F{z\Zi)}\. 

(ii) Conversely, if Cor{Ziy,F(f|Zi)} changes sign, so that the the direction of association with 
F{t\Z\) is not persistent throughout [0,t], then this will lead to a smaller value of \5j\ com- 
pared to (i). 

(iii) Lastly, if Cor{Ziy,F(f|Zi)} = then 5j = 0. 
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In other words, the sample version dj estimates a time-averaged summary of the correlation function 
1 1— Cor{Ziy,F(?|Zi)} which takes into account both magnitude and persistent behavior throughout 
[0, t]. This indicates that the FAST statistic is relevant forjudging marginal association of features 



with survival beyond the model-specific setting of Section 2. 1 



3 Independent screening with the FAST statistic 

In this section, we extend the heuristic arguments of the previous section and provide theoretical 
justification for using the FAST statistic to screen for relevant features when the data-generating 
model belongs to a class of single-index hazard rate regression models. 

3.1 The general case of single-index hazard rate models 

In the notation of Section [2| we assume survival times Tj to have hazard rate functions of single- 
index form: 

Xj{t)=X{t,Z]a^), j = l,l,...,n. (7) 

Here A : [0,oo) x M — is a continuous function, Zi, . . . ,Z„ are random vectors in W", (xP G 
W" is a vector of regression coefficients, and Zjot^ defines a risk score. We subscript phy n to 
indicate that the dimension of the feature space can grow with the sample size. Censoring will 
always be assumed at least independent so that Cj is independent of Tj conditionally on Z,j. We 
impose the following assumption on the hazard 'link function' A: 

Assumption 1. The survival function exp{ — /q • )As} is strictly monotonic for each f > 0. 

Requiring the survival function to depend monotonically on Zja'^ is natural in order to enable 
interpretation of the components of as indicative of positive or negative association with survival. 
Note that it suffices that A (f , • ) is strictly monotonic for each f > 0. Assumption[T]holds for a range 
of popular survival regression models. For example, A(f,x) := '^^{t) +x with Aq some baseline 
hazard yields the Lin-Ying model Q; X{t,x) := Ao(f)e'' is a Cox model; and X{t,x) := e^Xo{te^) is 
an accelerated failure time model. 

Denote by 5 the population version of the FAST statistic under the model (|7]l which, by the 
Doob-Meyer decomposition dNi {t) = dMi {t) + Yi {t)X , Z|a°)df with Mi a martingale, takes the 
form 



5=E 



' r{Zy-e{t)}Ut)X{t,Zja')dt]- where e(0 := ^i^Jj^'l^ - (8) 

Our proposed FAST screening procedure is as follows: given some (data-dependent) threshold 

Yn>0, 

(i) . calculate the FAST statistic d from the available data and 

(ii) . declare the 'relevant features' to be the set {1 < j < p„ : \dj\> Yn}- 
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By the arguments in Section [2] this procedure defines a natural survival equivalent of correlation 
screening. Define the following sets of features: 

M^={l<j<A, : \dj\>Yn}, 
M":={l<j<A, : ajVO}, 
M"s:={l<j<p, : dj^O}. 

The problem of establishing the SIS property of FAST screening amounts to determining when 
M" C holds with large probability for large n. This translates into two questions: first, when 
do we have M"g C M^; second, when do we have M" C M'J? The first question is essentially 
model-independent and requires establishing an exponential bound for — 6j\ as « — oo. The 

second question is strongly model-dependent and is answered by manipulating expectations under 
the single-index model (j7]). 

We state the main results here and relegate proofs to the appendix where we also state various 
regularity conditions. The following principal assumptions, however, deserve separate attention: 

Assumption 2. There exists c £ W" such that E{Zi \Zj a'^) = cZ|a°. 

Assumption 3. The censoring time Ci depends on Ti,Zi only through Zij, j ^ M". 

Assumption 4. Zij, j € M" is independent of Zij, j ^ M". 

Assumption |2] is a 'linear regression' property which holds true for Gaussian features and, more 
generally, for features following an elliptically contoured distribution (|Hardin[ |1982[). In view of 



Hall and Li (19931 which states that most low dimensional projections of high dimensional fea- 



tures are close to linear. Assumption |2] may not be unreasonable a priori even for general feature 
distributions when /?„ is large. 

Assumption |3]restricts the censoring mechanism to be partially random in the sense of depend- 
ing only on irrelevant features. As we will discuss in detail below, such rather strong restrictions 
on the censoring distribution seem necessary for obtaining the SIS property; Assumption [3] is both 
general and convenient. 

Assumption]?] is the partial orthogonality condition also used by |Fan and Song ( 2010| ). Under 
this assumption and Assumption [5] it follows from ^ that dj = whenever j ^ M", implying 

C M". Provided that we also have 5j / for j G M" (that is, M" C MJJ^g), there exists a 
threshold ^„ > such that 

min 15/1 > C„ max 15/1 = 0. 

jeM" j^M" 

Consequently, Assumptions [3]|4] are needed to enable consistent model selection via independent 
screening. Although model selection consistency is not essential in order to capture just some 
superset of the relevant features via independent screening, it is pertinent in order to limit the size 
of such a superset. 

The following theorem on FAST screening (FAST-SIS) is our main theoretical result. It states 
that the screening property M" C M[J may hold with large probability even when pn grows expo- 
nentially fast in a certain power of n which depends on the tail behavior of features. The covariance 
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condition in the theorem is analogous to that of Fan and Song (2010) for SIS in generalized linear 

models with Gaussian features. 

Theorem 1. Suppose that Assumptions [T]j3] hold alongside the regularity conditions of the ap- 
pendix and that P{\Z\j\ > s) < loexp{—lis^) for some positive constants lQ,l\,r] and suffi- 
ciently large s. Suppose moreover that for some ci > and K" < 1/2, 

|Cov[Zly,zTa°}]|>Cl«-^ 7GM". (9) 

Then M" C Mg. Suppose in addition that = cxn^'^ for some constant < C2 < ci/2 and that 
logp„ = o{«(i-2»c)r,/()7+2)| jj^gj^ tj^g property holds, P(M" C M^) ^ 1 when n-^oo. 

Observe that with bounded features, we may take T] = oo and handle dimension of order log /?„ = 

We may dispense with Assumption 2 on the feature distribution by revising Q. By Lemma [5] 
in the appendix, taking ej{t) := E{Z\jP{T\ > t\Zi)}/E{P{Ti > t\Zi)}, it holds generally under 
Assumption|3]that 

5j=E{ej{nACiAr)}, j£M'\ 

Accordingly, if we replace ([9]) with the assumption that E\Z\jP{Ti > t\Zi)\ > c\n^^ uniformly in 
t for j G M", the conclusions of Theorem [T] still hold. In other words, we can generally expect 
FAST-SIS to detect features which are 'correlated with the chance of survival', much in line with 
Section [2] While this is valuable structural insight, the covariance assumption (|9]l seems a more 
operational condition. 

Assumption[3]is crucial to the proof of Theorem[T]and to the general idea of translating a model- 
based feature selection problem into a problem of hard-thresholding 8. A weaker assumption is not 
possible in general. For example, suppose that only Assumption|2]holds and that the censoring time 
also follows some single-index model of the form (|7]) with regression coefficients )3". Applying 



Lemma 2.1 of Cheng and Wu (I994i to ([8]), there exists finite constants (^1,(^2 (depending on n) 
such that 

5 = E(Cia° + C2i3°). (10) 

It follows that unrestricted censoring will generally confound the relationship between 8 and Ea°, 
hence a^. The precise impact of such unrestricted censoring seems difficult to discern, although 
([To]) suggests that FAST-SIS may still be able to capture the underlying model (unless ^ia° + 11,2^'^ 
is particularly ill-behaved). We will have more to say about unrestricted censoring in the next 
section. 

Theorem [T] shows that FAST-SIS can consistently capture a superset of the relevant features. A 
priori, this superset can be quite large; indeed, 'perfect' screening would result by simply including 
all features. For FAST-SIS to be useful, it must substantially reduce feature space dimension. Below 
we state a survival analogue of Theorem 5 in Fan and Song ( 2010| ), providing an asymptotic rate on 



the FAST-SIS model size. 
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Theorem 1. Suppose that Assumptions [T]|4] hold alongside the regularity conditions of the ap- 
pendix and that P(|Ziy| > 5') < /oexp(—/i^'') for positive constants /o,^i,T] and sufficiently large 
s. If Yn = CA,n^^^ for some K" < 1/2 and C4 > 0, there exists a positive constant C5 such that 

P[|M^| < 0{n'^K.^{Z)}] > 1 - 0(p„exp{-C5«(i-2'^)''/(''+2)}); 

where Amax (^) denotes the maximal eigenvalue of the covariance matrix E of the feature dis- 
tribution. 

Informally, the theorem states that, under similar assumptions as in Theorem [1] and the partial or- 
thogonality condition (Assumption |4]), if features are not too strongly correlated (as measured by 
the maximal eigenvalue of the covariance matrix) and pn grows sufficiently fast, we can choose a 
threshold 7,, for hard-thresholding such that the false selection rate becomes asymptotically negli- 
gible. 

Our theorems say little about how to actually select the hard-thresholding parameter in prac- 
tice. Following Fan and Lv (20081 and Fan et al. ( 2009| ), we would typically choose such that 



M" I is of order «/ log n. Devising a general data-adaptive way of choosing 7,, is an open problem; 



false-selection-based criteria are briefly mentioned in Section 3.3 



3.2 The special case of the Aalen model 

Additional insight into the impact of censoring on FAST-SIS is possible within the more restrictive 



context of the nonparametric Aalen model with Gaussian features (Aalen (1980 1; Aalen (19891). 
This particular model asserts a hazard rate function for 7] of the form 

Xj{t)=h{t) + 'L]a\t), j = l,2,...,«; (11) 

for some baseline hazard rate function Ao and OtP a vector of continuous regression coefficient 
functions. The Aalen model extends the Lin-Ying model of Section [2] by allowing time- varying 
regression coefficients. Alternatively, it can be viewed as defining an expansion to the first order of 
a general hazard rate function in the class ^ in the sense that 



(12) 

.v=0 



dx 

For Aalen models with Gaussian features, we have the following analogue to Theorem [T] 

Theorem 3. Suppose that Assumptions [T]|2] hold alongside the regularity conditions of the ap- 
pendix. Suppose moreover that the Zi is mean zero Gaussian and that T\ follows a model of 
the form ( [TT] ) with regression coefficients a". Assume that C\ also follows a model of the form 
( [TT] ) conditionally on Zi and that censoring is independent. Let ^^{t) := fQa^{s)ds. If for 
some K" < 1/2 and c\ > 0, we have 

\Coy[Zij,ZjE{A°{TiACiAr)}]\>cin-'', ; G M", (13) 

then the conclusions of Theorem [T] hold with rj =2. 
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In view of ([T2]l, Theorem[3]can be viewed as establishing, within the model class Q, conditions for 
first-order validity of FAST-SIS under a general (independent) censoring mechanism and Gaussian 



features. The expectation term in ( [T3| ) is essentially the 'expected regression coefficients at the exit 
time' which is strongly dependent on censoring through the symmetric dependence on survival and 
censoring time. 

In fact, general independent censoring is a nuisance even in the Lin-Ying model which would 
otherwise seem the 'natural model' in which to use FAST-SIS. Specifically, assuming only inde- 
pendent censoring, suppose that Ti follows a Lin-Ying model with regression coefficients a° con- 
ditionally on Zi and that Ci also follows some Lin-Ying model conditionally on Zi. If 
where the components of Zi are i.i.d. with mean zero and unit variance, there exists a. x p„ 
diagonal matrix C such that 

See Lemma[6]in the appendix. It holds that C has constant diagonal iff features are Gaussian; oth- 
erwise the diagonal is nonconstant and depends nontrivially on the regression coefficients of the 
censoring model. A curious implication is that, under Gaussian features, FAST screening has the 
SIS property for this 'double' Lin-Ying model irrespective of the (independent) censoring mecha- 
nism. Conversely, sufficient conditions for a SIS property to hold here under more general feature 
distributions would require the jth component of E^/^CE^^^a*' to be 'large' whenever 01^ is 'large'; 
hardly a very operational assumption. In other words, even in the simple Lin-Ying model, unre- 
stricted censoring complicates analysis of FAST-SIS considerably. 

3.3 Scaling the FAST statistic 

The FAST statistic is easily generalized to incorporate scaling. Inspection of the results in the 
appendix immediately shows that multiplying the FAST statistic by some strictly positive, deter- 
ministic weight does not alter its asymptotic behavior. Under suitable assumptions, this also holds 
when weights are stochastic. In the notation of Section [2j the following two types of scaling are 
immediately relevant: 

df = djBjl^^ (Z-FAST); (15) 
dY^ = djDjl (Lin-Ying-FAST). (16) 

The Z-FAST statistic corresponds to standardizing d by its estimated standard deviation; screening 
with this statistic is equivalent to the standard approach of ranking features according to univariate 
Wald p-values. Various forms of asymptotic false-positive control can be implemented for Z-FAST, 
courtesy of the central limit theorem. Note that Z-FAST is model-independent in the sense that 
its interpretation (and asymptotic normality) does not depend on a specific model. In contrast, the 
Lin-Ying-FAST statistic is model-specific and corresponds to calculating the univariate regression 
coefficients in the Lin-Ying model, thus leading to an analogue of the idea of 'ranking by absolute 



regression coefficients' of Fan and Song (2010 1 . 
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We may even devise a scaling of d which mimicks the 'ranking by marginal likelihood ratio' 
screening of |Fan and Song (2010 1 by considering univariate versions of the natural loss function 
P 1-^ j5^Dj5 — 2j5^d for the Lin-Ying model. The components of the resulting statistic are rather 
similar to ([T6]l, taking the form 



1/2 



(loss-FAST). 



(17) 



Additional flexibility can be gained by using a time-dependent scaling where some strictly 
positive (stochastic) weight is multiplied on the integrand in ([T]). This is beyond the scope of the 
present paper. 



4 Beyond simple independent screening - iterated FAST screening 

The main assumption underlying any SIS method, including FAST-SIS, is that the design is close 
to orthogonal. This assumption is easily violated: a relevant feature may have a low marginal 
association with survival; an irrelevant feature may be indirectly associated with survival through 
associations with relevant features etc. To address such issues, [Fan and Lv| ( |2008| ) and Fan et al. 
( |2009[ ) proposed various heuristic iterative SIS (ISIS) methods which generally work as follows. 
First, SIS is used to recruit a small subset of features within which an even smaller subset of features 
is selected using a (multivariate) variable selection method such as penalized regression. Second, 
the (univariate) relevance of each feature not selected in the variable selection step is re-evaluated, 
adjusted for all the selected features. Third, a small subset of the most relevant of these new features 
is joined to the set of already selected features, and the variable selection step is repeated. The last 
two steps are iterated until the set of selected features stabilizes or some stopping criterion of choice 
is reached. 

We advocate a similar strategy to extend the application domain of FAST-SIS. In view of Sec- 
tion 2.1 a variable step using a working Lin-Ying model is intuitively sensible. We may also pro- 



vide some formal justification. Firstly, estimation in a Lin-Ying model corresponds to optimizing 
the loss function 

L(i3):=i3^Di5-2j5^d; (18) 



where D was defined in Section 2.1 As discussed by [Martinussen and Scheike (20091, the loss 
function ( fTS] ) is meaningful for general hazard rate models: it is the empirical version of the mean 
squared prediction error for predicting, with a working Lin-Ying model, the part of the intensity 
which is orthogonal to the at-risk indicator. In the present context, we are mainly interested in 
the model selection properties of a working Lin-Ying model. Suppose that T\ conditionally on Zi 
follows a single-index model of the form (|7]) and that Assumptions 3]|4]hold. Suppose that Aj3° = 8 



with A the in probability limit of D. Then aj' = implies j3j' = ( Hattori 2006 1 so that a working 
Lin-Ying model will yield conservative model selection in a quite general setting. Under stronger 
assumptions, the following result, related to work by [Brillinger ( 1983 1 and Li and Duan ( 1989 1, is 
available. 
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Theorem 4. Assume that T\ conditionally on Zi follows a single-index model of the form 
([7]). Suppose moreover that Assumption [2] holds and that C\ is independent of T\,Z,\ (random 
censoring). If defined by Aj3*' = 5 is the vector of regression coefficients of the associated 
working Lin-Ying model and A is nonsingular, then there exists a nonzero constant v depending 
only on the distributions of Xja,^ and C\ such that /S" = VOt*^. 

Thus a working Lin-Ying model can consistently estimate regression coefficient signs under mis- 
specification. From the efforts of Zhu et al. ( 2009[ ) and Zhu and Zhu (2009 1 for other types of 
single-index models, it seems conceivable that variable selection methods designed for the Lin- 
Ying model will enjoy certain consistency properties within the model class (|7]). The conclusion of 
Theorem |4] continues to hold when A is replaced by any matrix proportional to the feature covari- 
ance matrix E. This is a consequence of Assumption [2] and underlines the considerable flexibility 
available when estimating in single-index models. 



Variable selection based on the Lin-Ying loss ( |T8) can be accomplished by optimizing a penal- 
ized loss function of the form 

(19) 



where px : 



)5^L(i5) + £p,(|j8,-|); 

7=1 

is some nonnegative penalty function, singular at the origin to facilitate model 



selection (Fan and Li 2001 1 and depending on some tuning parameter A controlling the sparsity of 
the penalized estimator. A popular choice is the lasso penalty ( Tibshirani 2009[ ) and its adaptive 
variant ( Zou 2006 1, corresponding to penalty functions px{\^j\) = X\^j\ and px{\^j\) = 'X\^j\/\^j\ 



Ma 



^ 

with j3 some root n consistent estimator of j3 , respectively. These penalties were studied by 
and Leng (20071 and Martinussen and Scheike ( 2009| ) for the Lin-Ying model. Empirically, we 
have had better success with the one-step SCAD (OS-SCAD) penalty of |Zou and Li| ( |200^ than 
with lasso penalties. Letting 

{aX - 



wi{x) :=Al(x< + - 



1 



-l(x>A), a>2 



an OS-SCAD penalty function for the Lin-Ying model can be defined as follows: 

pxm)--=^x{mj\m- 



(20) 



(21) 



Here j5 := argmin|5L(j3) is the unpenalized estimator and D := tr(D) is the average diagonal 
element of D; this particular re-scaling is just one way to lessen dependency of the penalization on 
the time scale. If D has approximately constant diagonal (which is often the case for standardized 
features), then re-scaling by D leads to a similar penalty as for OS-SCAD in the linear regression 



model with standardized features. The choice a = 3.7 in (20) was recommended by Fan and Li 



(2001 1. OS-SCAD has not previously been explored for the Lin-Ying model but its favorable per- 



formance in ISIS for other regression models is well known ( [Fan et al\ |2009[ |2010| ). OS-SCAD 
can be implemented efficiently using, for example, coordinate descent methods for fitting the lasso 
( Gorst-Rasmussen and Scheike[ 201 1[ Friedman et al. 20071. For fixed p, the OS-SCAD penalty 
( |2T] ) has the oracle property if the Lin-Ying model holds true. A proof is beyond scope but follows 
by adapting |Zou and LT| ( |2008[ I along the lines of [Martinussen and Scheike| ( |2009[ l. 
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In the basic FAST-ISIS algorithm proposed below, the initial recruitment step corresponds to 
ranking the regression coefficients in the univariate Lin-Ying models. This is a convenient generic 



choice because it enables interpretation of the algorithm as standard 'vanilla ISIS' ( Fan et aL \ 2009 1 
for the Lin-Ying model. 

Algorithm 1 (Lin-Ying-FAST-ISIS). Set M := {1, . . . ,p}, let fmax be some pre-defined maximal 
number of iterations of the algorithm. 

1. (Initial recruitment). Perform SIS by ranking \djD'^^\, j = l,...,p„, according to de- 
creasing order of magnitude and retaining the Icq < d most relevant features Ai C M. 

2. For r = 1,2, . . . do: 

(a) (Feature selection). Define coj := ooif j £ Ar and coj := otherwise. Estimate 

P := argmin JL(i3) + £ (Ojp^Ml)}' 



with px defined in ( |2T] ) for some suitable tuning parameter X. Set := {j : jij / 
0}. 

(b) If r > 1 and B,- = B^-i, or if r = r^ax; return B^. 

(c) (Re-recruitment). Otherwise, re-evaluate relevance of features in M\B,. according 
to the absolute value of their regression coefficient |j3y| in the |M\B,-| unpenalized 
Lin-Ying models including each feature in M\Br and all features in B^, i.e. 

Pj:=Pi^\ where ^^'' = argminpj^.j^^L(i3{^.}uB,), ; G M\B,. (22) 

Take A^+i := U B^ where is the set of the kr most relevant features in M\Ar, 
ranked according to decreasing order of magnitude of 



Fan and Lv (2008 1 recommended choosing d to be of order ?i/log?i. Following Fan et al. (2009 1, 
we may take ko = [2d/3\ and ^/ = J — |A/| at each step. This k[) ensures that we complete at least 
one iteration of the algorithm; the choice of k,. for r > ensures that at most d features are included 
in the final solution. 

Algorithm [T] defines an iterated variant of SIS with the Lin-Ying-FAST statistic ( [T6| ). We can 
devise an analogous iterated variant of Z-FAST-SIS in which the initial recruitment is performed 



by ranking based on the statistic ( 15 1, and the subsequent re-recruitments are performed by ranking 
|Z| -statistics in the multivariate Lin-Ying model according to decreasing order of magnitude, using 
the variance estimator ([6]l. A third option would be to base recruitment on ( 17 1 and re-recruitments 



on the decrease in the multivariate loss ( [18] ) when joining a given feature to the set of features picked 
out in the variable selection step. 

The re-recruitment step (b.iii) in Algorithm [T] resembles that of Fan et al. (20091. Its naive 



implementation will be computationally burdensome when p„ is large, requiring a low-dimensional 
matrix inversion per feature. Significant speedup over the naive implementation is possible via the 
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matrix identity 



D 





where fc = f'-f^D"if. 



(23) 



Note that only the first row of D Ms required for the re-recruitment step so that (|22]) can be imple- 
mented using just a single low-dimensional matrix inversion alongside 0{pn) matrix/vector multi- 



pUcations. Combining ( [23] ) with Q, a similarly efficient implementation applies for Z-FAST-ISIS. 

The variable selection step (b.i) of Algorithm [T] requires the choice of an appropriate tuning 
parameter. This is traditionally a difficult part of penalized regression, particularly when the aim 



is model selection where methods such as cross-validation are prone to overfitting (Leng et al. 



2007 1. Previous work on ISIS used the Bayesian information criterion (BIC) for tuning parameter 



selection (Fan et al. 20091. Although BIC is based on the likelihood, we may still define the 



following 'pseudo BIC based on the loss ( [18] ): 

PBIC(A) = K{L{^^)-Lm + n-'dhlogn. 



(24) 



Here "^^e penalized estimator, p is the unpenalized estimator, K" > is a scaling constant of 
choice, and df;^ estimates the degrees of freedom of the penalized estimator. A computationally 
convenient choice is df;^ = llo (Zou et al. 2007 1. It turns out that choosing X = argmin;LPBIC;L 



may lead to model selection consistency. Specifically, the loss (18) for the Lin-Ying model is of 



the least-squares type. Then we can repeat the arguments of Wang and Leng (2007 ) and show that. 



under suitable consistency assumptions for the penalized estimator, there exists a sequence A„ — )• 
yielding selection consistency for and satisfying 



p| inf PBIC(A) >PBIC(A„)} 



1, 



(25) 



with S the union of the set of tuning parameters A which lead to overfitted (strict supermodels of 
the true model), respectively underfitted models (any model which do not include the true model). 
While ([25]) holds independently of the scaling constant K, the finite-sample behavior of PBIC de- 
pends strongly on K. A sensible value may be inferred heuristically as follows: the range of a 'true' 
likelihood BIC is asymptotically equivalent to a Wald statistic in the sense that (for fixed p). 



BIC(O) - BIC(oo) = jS^LWi^ML +Op(« 



(26) 



with ^jyjL the maximum likelihood estimator and I()3o) ~ « 'Var(^]yjL — Po) ^ the information 
matrix. We may specify K by requiring that PBIC(O) — PBIC(oo) admits an analogous interpretation 
as a Wald statistic. Since PBIC(O) - PBIC(oo) = jcd^D^^d + Op(«"'/2), it follows from ^ that we 
should choose 

K := 7—. 

d^D Id 

This choice of K also removes the dependency of PBIC on the time scale. 
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5 Simulation studies 



In this section, we investigate the performance of FAST screening on simulated data. Rather than 
comparing with popular variable selection methods such as the lasso, we will compare with anal- 



ogous screening methods based on the Cox model (Fan et al. 2010). This seems a more pertinent 
benchmark since previous work has already demonstrated that (iterated) SIS can outperform vari- 
able selection based on penalized regression in a number of cases (Fan and Lv ( 2008| l; [Fan et at. 
( |2009l )). 

For all the simulations, survival times were generated from three different conditionally expo- 
nential models of the generic form (|7]); that is, a time-independent hazard 'link function' applied to 
a hnear functional of features. For suitable constants c, the hnk functions were as follows (see also 
Figure [T}: 



Logit 
Cox 
Log 



{l+exp(ciogitx} 1 



exp(_CcoxXj 



■^logit(^)-^) 
/lcox(f,-^) 

X\og{t,x) := log{e+(ciog;c)^}{l+exp(ciog;c)}"^ 
The link functions represent different characteristic effects on the feature functional, ranging from 
uniformly bounded (logit) over fast decay/increase (Cox), to fast decay/slow increase (log). We 
took ciogit = 1.39, Ccox = 0.68, and ciog = 1.39 and, unless otherwise stated, survival times were 
right-censored by independent exponential random variables with rate parameters 0.12 (logit link), 
0.3 (Cox link) and 0.17 (log link). These constants were selected to provide a crude 'calibration' 
to make the simulation models more comparable: for a univariate standard Gaussian feature Z\, a 
regression coefficient j8 = 1, and a sample size oin = 300, the expected |Z| -statistic was 8 for all 
three link functions with an expected censoring rate of 25%, as evaluated by numerical integration 
based on the true likelihood. 



Methods for FAST screening have been implemented in the R package 'ahaz' (Gorst-Rasmussen 



20111. 




Figure 1: The three hazard rate link functions used in the simulation studies 
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Table 1: MMMS and RSD (in parentheses) for basic SIS with n = 300 and p ^ 20, 000 (100 simulations). 



.5 = 3 j = 6 s = 9 5 = 3 s - 



s = 9 .5 = 3 j = 6 s = 9 






d 


3(1) 


32 (53) 


530 (914) 


3(0) 


7(5) 


45 (103) 


3(0) 


22 (44) 


202 (302) 






4(1) 


66 (95) 


678 (939) 


3(0) 


11(14) 


96 (176) 


3(1) 


41 (87) 


389 (466) 






3(1) 


40 (71) 


522 (873) 


3(0) 


7(7) 


48 (105) 


3(0) 


22 (45) 


262 (318) 




Cox 


3(1) 


44 (68) 


572 (928) 


3(0) 


7(4) 


40(117) 


3(0) 


26 (51) 


280 (306) 


0.25 


d 


3(0) 


6(1) 


11 (1) 


3(0) 


6(0) 


9(1) 


3(0) 


6(1) 


10(1) 




d^Y 


3(0) 


7(1) 


11 (2) 


3(0) 


6(1) 


10(1) 


3(0) 


7(1) 


11 (1) 




d^ 


3(0) 


6(1) 


11 (1) 


3(0) 


6(0) 


10(1) 


3(0) 


6(1) 


10(1) 




Cox 


3(0) 


6(1) 


11 (1) 


3(0) 


6(0) 


9(1) 


3(0) 


6(1) 


10(1) 


0.5 


d 


3(0) 


7(2) 


12(2) 


3(0) 


6(1) 


10(1) 


3(0) 


7(1) 


11 (2) 




d^Y 


3(0) 


9(3) 


13(1) 


3(0) 


8(2) 


13(2) 


3(0) 


8(2) 


12(2) 




d^ 


3(0) 


8(3) 


12(1) 


3(0) 


7(2) 


12(2) 


3(0) 


7(2) 


12(2) 




Cox 


3(1) 


9(3) 


13(2) 


3(0) 


6(1) 


11 (2) 


3(0) 


8(2) 


12(2) 


0.75 


d 


3(1) 


9(2) 


13(1) 


3(0) 


8(2) 


12(1) 


3(1) 


9(3) 


12(2) 




^LY 


4(2) 


11(3) 


14 (2) 


4(1) 


11 (3) 


14(1) 


4(2) 


10(2) 


13(1) 




d^ 


4(1) 


10(2) 


13(1) 


3(1) 


10(3) 


13(1) 


3(1) 


9(2) 


13(1) 




Cox 


5(3) 


12(2) 


14(1) 


3(0) 


7(2) 


12(2) 


4(1) 


11 (3) 


14 (2) 



5.1 Performance of FAST-SIS 

We first considered tlie performance of basic, non-iterated FAST-SIS. Features were generated as 



in scenario 1 of Fan and Song (2010). Specifically, let e be standard Gaussian. Define 



Zi;:=^^^, 7 = l,...,p; (27) 

where £y is independently distributed as a standard Gaussian for 7 = 1,2, ... , [;?/3j : independently 
distributed according to a double exponential distribution with location parameter zero and scale 
parameter 1 for 7 = [/?/3j + 1, . . . , [2/7/3J ; and independently distributed according to a Gaussian 
mixture 0.5A'^(— 1, 1) +0.5/^^^(1,0.5) for j = \lp/'i\ + 1, . . . ,/?. The constants aj satisfy £?! = ••• = 
a\s and ay = for j > 15. With the choice ai = y^p/(l — p), < p < 1, we obtain Cor(Zi;,Ziy) = p 
for / 7^ j, i,j < 15, enabling crude adjustment of the correlation structure of the feature distribution. 
Regression coefficients were chosen to be of the generic form a" = (1,1. 3, 1,1. 3,...)^ with exactly 
the first s components nonzero. 

For each combination of hazard link function, non-sparsity level s, and correlation p , we per- 
formed 100 simulations with p = 20,000 features and n = 300 observations. Features were ranked 



using the vanilla FAST statistic, the scaled FAST statistics ( 15 1 and ( 16 1, and SIS based on a Cox 
working model (Cox-SIS), the latter ranking features according their absolute univariate regression 
coefficient. Results are shown in Table [T] As a performance measure, we report the median of the 
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0.25 0.5 
P 



0.75 



0.25 0.5 
P 



0.75 



0.25 0.5 
P 



0.75 



Figure 2: Minimum observed |Z|-statistics in the oracle model under Aiog, for the SIS simulation study. 



minimum model size (MMS) needed to detect all relevant features alongside its relative standard 
deviation (RSD), the interquartile range divided by 1.34. The MMS is a useful performance mea- 
sure for this type of study since it eliminates the need to select a threshold parameter for SIS. The 
censoring rate in the simulations was typically 30%-40%. 

For all methods, the MMMS is seen to increase with feature correlation p and non-sparsity s. 
As also noted by Fan and Song ( |2010 ) for the case of SIS for generalized linear models, some 
correlation among features can actually be helpful since it increases the strength of marginal signals. 
Overall, the statistic d^^ seems to perform slightly worse than both d and d^ whereas the latter two 
statistics perform similarly to Cox-SIS. In our basic implementation, screening with any of the 
FAST statistics was more than 100 times faster than Cox-SIS, providing a rough indication of the 
relative computational efficiency of FAST-SIS. 

To gauge the relative difficulty of the different simulation scenarios. Figure [2] shows box plots 
of the minimum of the observed |Z| -statistics in the oracle model (the joint model with only the 
relevant features included and estimation based on the likehhood under the true link function) for 
the link function Aiog. This particular link function represents an 'intermediate' level of difficulty; 
with |Z| -statistics for Acox generally being somewhat larger and |Z| -statistics for Aiogit being shghtly 
smaller. Even with oracle information and the correct working model, these are evidently difficult 
data to deal with. 



5.2 FAST-SIS with non-Gaussian features and nonrandom censoring 

We next investigated FAST-SIS with non-Gaussian features and a more complex censoring mecha- 
nism. The simulation scenario was inspired by the previous section but with all features generated 
according to either a standard Gaussian distribution, a f -distribution with 4 degrees of freedom, or 
a unit rate exponential distribution. Features were standardized to have mean zero and variance 
one, and the feature correlation structure was such that Cor(Zi,-,Ziy) = 0.125 for ij < 15, / 7^ j 
and Cor(Zi,-,Ziy) = otherwise. Survival times were generated according to the link function Aiog 
with regression coefficients )3 = (1, 1.3, 1, 1.3, 1, 1.3,0,0, . . .) while censoring times were generated 
according to the same model (link function Aiog and conditionally on the same feature realizations) 
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Table 2: MMMS and RSD (in parentheses) for SIS under nongaussian features/nonrandom censoring with 
n = 300 and p = 20,000 (100 simulations). 











k 






Feature distr. 




K — U 


—U.J 


— u.zj 


U.Zj 


KJ.J 


Gaussian 


a 


6 (1) 


(8) 


7(4) 


6(1) 


1 (3) 




jLY 

a 


(1) 


8 (6) 


7(3) 


7(2) 


8 (5) 






6 (li 


7 f6) 


7(2) 


6(1) 


7 (2) 




jjloss 


6(1) 


8(6) 


7(3) 


6(1) 


7(3) 




Cox 


6(1) 


8(5) 


7(2) 


6(1) 


7(2) 


t {dj = 4) 


a 


6(1) 


1-3 (17) 


7(5) 


6(1) 


1 (3) 




jLY 

a 


I 1 /"7\ 

II (7) 


12 (S) 


9 (7) 48 (136) 


(loj) 






7(3) 


17 (20) 


8(5) 


7(2) 


7(3) 




jjloss 


6(1) 


8(7) 


7(4) 


8(15) 


10(10) 




Cox 


7(4) 


15 (23) 


8(10) 


8(4) 


9(5) 


Exponential 


d 


6(1) 


6(2) 


6(1) 


7(4) 


8(7) 




jLY 


6(1) 


11 (12) 


7(3) 


6(1) 


6(1) 




d^ 


15 (10) 


34 (36) 


24(17) 


22 (28) 


26 (29) 




jjloss 


6(0) 


7(4) 


6(1) 


6(1) 


6(1) 




Cox 


8(4) 


22 (31) 


14(11) 


9(6) 


9(8) 



with regression coefficients = kp. The constant k controls the association between censoring and 
survival times, leading to a basic example of nonrandom censoring (competing risks). 

Using p = 20,000 features and n = 300 observations, we performed 100 simulations under each 
of the three feature distributions, for different values of k. Table |2]reports the MMMS and RSD for 



the four different screening methods of the previous section, as well as for the statistic d'"***^ in ( 17 1. 
The censoring rate in all scenarios was around 50%. 

From the column with ^ = (random censoring), the heavier tails of the f-distribution increases 
the MMMS, particularly for d^^. The vanilla FAST statistic d seems the least affected here, most 
likely because it does not directly involve second-order statistics which are poorly estimated due to 
the heavier tails. While d^ and d^"**** are also scaled by second-order statistics, the impact of the tails 
is dampened by the square-root transformation in the scaling factors. In contrast, the more distinctly 
non-Gaussian exponential distribution is problematic for d^. Overall, the statistics d and d'"**** seems 
to have the best and most consistent performance across feature distributions. Nonrandom censor- 
ing generally increases the MMMS and RSD, particularly for the non-Gaussian distributions. There 
appears to be no clear difference between the effect of positive and negative values of k. We found 
that the effect of / diminished when the sample size was increased (results not shown), suggest- 
ing that nonrandom censoring in the present example leads to a power rather than bias issue. This 



may not be surprising in view of the considerations below ( |T4| ). However, the example still shows 
the dramatic impact of nonrandom censoring on the performance of SIS. 
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5.3 Performance of FAST-ISIS 

We lastly evaluated the ability of FAST-ISIS (Algorithm [T]) to cope with scenarios where FAST- 
SIS fails. As in the previous sections, we compare our results with the analogous ISIS screening 
method for the Cox model. To perform Cox-ISIS, we used the R package 'SIS', with (re)recruitment 
based on the absolute Cox regression coefficients and variable selection based on OS-SCAD. We 
also compared with Z-FAST-ISIS variant described below Algorithm [T] in which (re)recruitment is 
based on the Lin-Ying model |Z| -statistics (results for FAST-ISIS with (re)recruitment based on the 
loss function were very similar). 



For the simulations, we adopted the structural form of the feature distributions used by Fan et al. 



(2010 1. We considered n = 300 observations and p = 500 features which were jointly Gaussian 
and marginally standard Gaussian. Only regression coefficients and feature correlations differed 
between cases as follows: 

1. The regression coefficients are jSi = -0.96, ^ = 0.90, jSs = 1.20, = 0.96, jSj = -0.85, 
jSe = 1.08 and j3y = for j > 6. Features are independent, Cor(Zi,,Ziy) = for / ^ j. 

2. The regression coefficients are the same as in (a) while Corr(Zi,',Ziy) = 0.5 for / ^ j. 

3. The regression coefficients are = p2 = p3 = 4/3, p4 = —2^/2. The correlation between 
features is given by Cor(Zi.4,Zij) = 1 /\/2 for j 7^ 4 and Cor(Zi,-,Ziy) = 0.5 for / 7^ j, i,j ^ 4. 

4. The regression coefficients are pi = = p3 = 4/3, ^4 = —2^/2 and jSs = 2/3. The corre- 
lation between features is Cor(Zi,4,Ziy) = 1/ \/2 for j ^ {4,5}, Cor(Zi 5,Ziy) = for j / 5, 
and Cor(Zi;,Ziy) = 0.5 for / / j, i,j ^ {4,5}. 

Case (a) serves as a basic benchmark whereas case (b) is harder because of the correlation between 
relevant and irrelevant features. Case (c) introduces a strongly relevant feature Z4 which is not 
marginally associated with survival; lastly, case (d) is similar to case (c) but also includes a feature 
Z5 which is weakly associated with survival and does not 'borrow' strength from its correlation with 
other relevant features. 



Following Fan a/. (20101, we took (i = [?i/log?i/3j = 17 for the initial dimension reduction; 
performance did not depend much on the detailed choice of d of order ?i/log?i. For the three 
different screening methods, ISIS was run for maximum of 5 iterations. (P)BIC was used for tuning 
the variable selection steps. Results are shown in Table [3] summarized over 100 simulations. We 
report the average number of truly relevant features selected by ISIS and the average final model 
size, alongside standard deviations in parentheses. To provide an idea of the improvement over 
basic SIS, we also report the median of the minimum model size (MMMS) for the initial SIS step 
(based on vanilla FAST-SIS only). The censoring rate in the different scenarios was 25%-35%. 

The overall performance of the three ISIS methods is comparable between the different cases. 
All methods deliver a dramatic improvement over non-iterated SIS, but no one method performs 
significantly better than the others. The two FAST-ISIS methods have a surprisingly similar per- 
formance. As one would expect, Cox-ISIS does particularly well under the link function Acox but 
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Table 3: Simulation results for ISIS with n = 300, p = 500 and d ~ 17 (100 simulations). Numbers in 
parentheses are standard deviations (or relative standard deviation, for the MMMS). 







IVllVllVlij [KOLJJ 


Average no. 


true positives (ISIS) 


Average model size (ISIS) 


T V PA QT 


7 FAST 




T Y FAST 


7 FAST 




'4ogit 


(a) 




6.0 (0) 


6.0 (0) 


5 5 (1) 


7 8 (1) 


7 9 (2) 


6 3 (2) 




(b) 


500(1) 


5.5 (1) 


5 5 (1) 


3 4 (1) 


7 (2) 


6 7 (2) 


4 3 (2) 




(c) 


240 (125) 


3.7(1) 


3.8(1) 


3.0 (2) 


5.2 (2) 


5.7 (3) 


4.5 (4) 




(d) 


230 (124) 


4.8(1) 


4.7(1) 


3.5 (2) 


5.9 (2) 


6.2 (3) 


4.9 (4) 


^ox 


(a) 


7(1) 


6.0 (0) 


6.0 (0) 


6.0 (0) 


7.5(1) 


7.5(1) 


6.2(1) 




(b) 


500 (1) 


5.8(1) 


5.8(1) 


5.6(1) 


7.2 (2) 


6.8(1) 


6.4 (2) 




(c) 


218(120) 


3.7(1) 


3.6(1) 


3.0 (2) 


5.1 (3) 


5.3 (3) 


4.9 (4) 




(d) 


258 (129) 


4.9(1) 


4.8(1) 


3.8 (2) 


6.3 (2) 


6.0 (2) 


6.4 (5) 




(a) 


6(1) 


6.0 (0) 


6.0 (0) 


6.0 (0) 


7.3(1) 


7.4(1) 


6.3(1) 




(b) 


500(1) 


5.8(1) 


5.7(1) 


4.9(1) 


7.2 (2) 


6.7(1) 


5.7 (2) 




(c) 


252 (150) 


3.9 (0) 


3.9(1) 


3.4(1) 


5.3 (2) 


4.9 (2) 


5.5 (5) 




(d) 


223 (132) 


4.9(1) 


4.8(1) 


4.0 (2) 


6.0 (2) 


6.1 (2) 


5.9 (5) 



does not appear to be uniformly better than the two FAST-ISIS methods even in this ideal setting. 
Under the link function Aiogit, both FAST-ISIS methods outperform Cox-ISIS in terms of the num- 
ber of true positives identified, as do they for the link function Aiog, although less convincingly. On 
the other hand, the two FAST-ISIS methods generally select slightly larger models than Cox-ISIS 
and their false-positive rates (not shown) are correspondingly slightly larger. FAST-ISIS was 40-50 
times faster than Cox-ISIS, typically completing calculations in 0.5-1 seconds in our specific im- 
plementation. Figure |3] shows box plots of the minimum of the observed |Z| -statistics in the oracle 
model (based on the hkelihood undebr the true model). 




Case (b) 





Case (d) 




^logit A.COX ^lo 



^logit A.COX ^log 



^logit A.COX ^lo 



^logit A.CC 



Figure 3: Minimum observed |Z| -statistics in the oracle models for the FAST-ISIS simulation study. 
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We have experimented with other link functions and feature distributions than those described 
above (results not shown). Generally, we found that Cox-ISIS performs worse than FAST-ISIS 
for bounded link functions. The observation from Table [3j that FAST-ISIS may improve upon 
Cox-ISIS even under the link function Acox, does not necessarily hold when the signal strength is 
increased. Then Cox-ISIS will be superior, as expected. Changing the feature distribution to one 
for which the linear regression property (Assumption [2]l does not hold leads to a decrease in the 
overall performance for all three ISIS methods. 



6 Application to AML data 



The study by Metzeler et al. ( 2008 1 concerns the development and evaluation of a prognostic gene 
expression marker for overall survival among patients diagnosed with cytogenetically normal acute 
myeloid leukemia (CN-AML). A total of 44,754 gene expressions were recorded among 163 adult 
patients using Affymetrix HG-U133 AlB microarrays. Based the method of supervised principal 



components (Bair and Tibshirani 2004 1, the gene expressions were used to develop an 86-gene sig- 
nature for predicting survival. The signature was validated on an external test data set consisting of 
79 patients profiled using Affymetrix HG-U133 Plus 2.0 microarrays. All data is publicly available 
on the Gene Expression Omnibus web site ( [http://www.ncbi.nlm.nih.gov/geo/l l under the accession 
number GSE12417. The CN-AML data was recently used by Benner et al. pOlO l for comparing 
the performance of variable selection methods. 

Median survival time was 9.7 months in the training data (censoring rate 37%) and 17.7 months 
in the test data (censoring rate 41%). Preliminary to analysis, we followed the scaling approach 
employed by Metzeler et al. [ ( [2008] ) and centered the gene expressions separately within the test and 
training data set, followed by a scaling of the training data with respect to the test data. 

We first applied vanilla FAST-SIS to the « = 163 patients in the training data to reduce the di- 
mension from p = 44,754 to <i = [«/log(n)J = 31. We then used OS-SCAD to select a final set 
among these 31 genes. Since the PBIC criterion can be somewhat conservative in practice, we se- 



lected the OS-SCAD tuning parameter using 5-fold cross-validation based on the loss function ( 18 1. 
Specifically, using a random split of {1, ... , 163} into folds Fi,...,F=, of approximately equal size, 
we chose X as: 

A=argmin;ti:z.(^'H^-f,(^)}; 



1=1 



with L(^^ the loss function using only observations from Fi and (A) the regression coefficients 
estimated for a tuning parameter A, omitting observations from Fi. This approach yielded a set of 
7 genes, 5 of which also appeared in the signature of Metzeler et al. (2008 1. For ^ the estimated 



penalized regression coefficients, we calculated a risk score Zj j3 for each patient in the test data. 
In a Cox model, the standardized risk score had a hazard ratio of \.69 {p = 6 ■ 10^'*; Wald test). 



In comparison, lasso based on the Lin-Ying model (Leng et al. (2007 1 ; Martinussen and Scheike 



(2009 1) with 5-fold cross-validation gave a standardized risk score with a hazard ratio of 1.56 {p 



0.003; Wald test) in the test data, requiring 5 genes; Metzeler et al. (2008 1 reported a hazard ratio 



20 



Table 4: Prediction performance of FAST-SIS and Lin-Ying lasso in the AML data, evaluated in terms of the 
Cox hazard ratio for the standardized continuous risk score. The LOO calculations are based on the training 
data only. 



Scenario 


Summary statistic 




Screening method 




d 


dLY 


dl^l d'°^'* 


Lasso 


Test data 


Hazard ratio 


1.69 


1.59 


1.46 1.58 


1.54 




p-value 


6- 10-4 


0.0007 


0.01 0.002 


0.004 




No. predictors 


7 


1 


3 7 


5 


LOO 


p-value 


4- lO-'' 


0.16 


5-10-5 4.10-4 


4- 10-^ 




Median no. predictors 


7 





3 5 


5 



Table 5: Overlap between gene sets selected by the different screening methods and the signature of Metzeler 



et al. 



(20081. 





d 


^LY 


a\z\ 


jloss 


Lasso 


Metzeler 


d 


7 





1 


2 


4 


5 






1 














dlZ| 






3 


2 


2 


2 










7 


2 


5 


Lasso 










5 


5 


Metzeler 












86 



of 1.85 (p = 0.002) for their 86-gene signature. 



We repeated the above calculations for the three scaled versions of the FAST statistic ( 15 1-( 17 1. 
Since assessment of prediction performance using only a single data set may be misleading, we 
also validated the screening methods via leave-one-out (LOO) cross-validation based on the 163 
patients in the training data. For each patient j, we used FAST-SIS as above (or Lin-Ying lasso) 
to obtain regression coefficients ^ j based on the remaining 162 patients and defined the jth LOO 
risk score as the percentile of Zj p j among {Z, j3_y},-^y. We calculated Wald /^-values in a Cox 
regression model including the LOO score as a continuous predictor. Results are shown in Table |4] 
while Table [5] shows the overlap between gene sets selected in the training data. There is seen to be 
some overlap between the different methods, particularly between vanilla FAST-SIS and the lasso, 
and many of the selected genes also appear in the signature of Metzeler et al. ( |2008 1. In the test 
data, the prediction performance of the different screening methods was comparable whereas the 
lasso had a slight edge in the LOO calculations. Lin-Ying SIS selected only a single gene in the 
test data and typically selected no genes in the LOO calculations. We found FAST screening to be 
slightly more sensitive to the cross-validation procedure than the lasso. 

We next evaluated the extent to which iterated FAST-SIS might improve upon the above results. 
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From our limited experience with applying ISIS to real data, instability can become an issue when 
several iterations of ISIS are run; particularly when cross-validation is involved. Accordingly, we 
ran only a single iteration of ISIS using Z-FAST-ISIS. The algorithm kept 2 of the genes from 
the first FAST-SIS round and selected 3 additional genes so that the total number of genes was 5. 
Calculating in the test data a standardized risk score based on the final regression coefficients, we 
obtained a Cox hazard ratio of only 1.06 {p = 0.6; Wald test) which is no improvement over non- 
iterated FAST-SIS. A similar conclusion was reached for the corresponding LOO calculations in 
the training data which gave a Cox Wald p-value of 0.001 for the LOO risk score, using a median of 
4 genes. None of the other FAST-ISIS methods lead to improved prediction performance compared 
to their non-iterated counterparts. 

FAST-ISIS runs swiftly on this large data set: one iteration of the algorithm (re-recruitment 
and OSS-SCAD feature selection with 5-fold cross-validation) completes in under 5 seconds on a 
standard laptop. 

Altogether, the example shows that FAST-SIS can compete with a computationally more de- 
manding full-scale variable selection method in the sense of providing similarly sparse models with 
competitive prediction properties. FAST-ISIS, while computationally very feasible, did not seem to 
improve prediction performance over simple independent screening in this particular data set. 



7 Discussion 

Independent screening - the general idea of looking at the effect of one feature at a time - is a well- 
established method for dimensionality reduction. It constitutes a simple and excellently scalable 
approach to analyzing high-dimensional data. The SIS property introduced by [Fan and Lv| ( [2008] ) 



has enabled a basic formal assessment of the reasonableness of general independent screening meth- 



ods. Although the practical relevance of the SIS property has been subject to scepticism (Roberts 



2008| ), the formal context needed to develop the SIS property is clearly useful for identifying the 



many implicit assumptions made when applying univariate screening methods to multivariate data. 

We have introduced a SIS method for survival data based on the notably simple FAST statistic. 
In simulation studies, FAST-SIS performed on par with SIS based on the popular Cox model, while 
being considerably more amenable to analysis. We have shown that FAST-SIS may admit the 
formal SIS property within a class of single-index hazard rate models. In addition to assumptions 
on the feature distribution which are well known in the literature, a principal assumption for the SIS 
property to hold is that censoring times do not depend on the relevant features nor survival. While 
such partially random censoring may be appropriate to assume in many clinical settings, it indicates 
that additional caution is called for when applying univariate screening and competing risks are 
suspected. 

A formal consistency property such as the SIS property is but one aspect of a statistical method 
and does not make FAST-SIS universally preferable. Not only is the SIS property unlikely to be 
unique to FAST screening, but different screening methods often highlight different aspects of data 



(Ma and Song 2011 1, making it impossible and undesirable to recommend one generic method. 
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We do, however, consider FAST-SIS a good generic choice of initial screening method for general 
survival data. Ultimately, the initial choice of a statistical method is likely to be made on the basis 
of parsimony, computational speed, and ease of implementation. The FAST statistic is about as 
difficult to evaluate as a collection of correlation coefficients while iterative FAST-SIS only requires 
solving one linear system of equations. This yields substantial computational savings over methods 
not sharing the advantage of linearity of estimating equations. 

Iterated SIS has so far been studied to a very limited extent in an empirical context. The iterated 
approach works well on simulated data, but it is not obvious whether this necessarily translates into 
good performance on real data. In our example involving a large gene expression data set, ISIS 
did not improve results in terms of prediction accuracy. Several issues may affect the performance 
of ISIS on real data. First, it is our experience that the 'Rashomon effect' , the multitude of well- 



fitting models (Breiman 2001 1, can easily lead to stability issues for this type of forward selection. 
Second, it is often difficult to choose a good tuning parameter for the variable selection part of ISIS. 
Using BIC may lead to overly conservative results, whereas cross-validation may lead to overfitting 



when only the variable selection step - and not the recruitment steps - are cross-validated. He 



and Lin ( 2011[ l recently discussed how to combine ISIS with stability selection ( Meinshausen and 



Buhlmann 20101 in order to tackle instability issues and to provide a more informative output than 



the concise 'list of indices' obtained from standard ISIS. Their proposed scheme requires running 
many subsampling iterations of ISIS, a purpose for which FAST-ISIS will be ideal because of its 
computational efficiency. The idea of incorporating stability considerations is also attractive from a 
foundational point of view, being a pragmatic departure from the limiting de facto assumption that 
there is a single, true model. Investigation of such computationally intensive frameworks, alongside 
a study of the behavior of ISIS on a range of different real data sets, is a pertinent future research 
topic. 

A number of other extensions of our work may be of interest. We have focused on the important 
case of time-fixed features and right-censored survival times but the FAST statistic can also be used 
with time-varying features alongside other censoring and truncation mechanism supported by the 
counting process formalism. Theoretical analysis of such extensions is a relevant future research 



topic, as is analysis of more flexible, time-dependent scaling strategies for the FAST statistic. Fan 



et al. (2011 1 recently discussed SIS where features enter in nonparametric, smooth manner, and an 



extension of their framework to FAST-SIS appears both theoretically and computationally feasible. 
Lastly, the FAST statistic is closely related to the univariate regression coefficients in the Lin-Ying 
model which is rather forgiving towards misspecification: under feature independence, the uni- 
variate estimator is consistent whenever the particular feature under investigation enters the hazard 



rate model as a linear function of regression coefficients (Hattori 2006 1. The Cox model does not 



have a similar property (Struthers and Kalbfleisch 1986 1. Whether such internal consistency under 
misspecification or lack hereof affects screening in a general setting is an open question. 
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Appendix: proofs 

In addition to Assumptions 1-4 stated in the main text, we will make use of the following assump- 
tions for the quantities defining the class of single-index hazard rate models (|7]l: 

A. E{Zij) = and E{Zjj) = I, j = \,2,---,Pn. 

B. P{Fi(t) = 1} >0. 

C. Var(Zj'^flt'') is uniformly bounded above. 

The details in Assumption A are included mainly for convenience; it suffices to assume that E{Zjj)< 

Our first lemma is a basic symmetrization result, included for completeness. 

Lemma 1. Let X be a random variable with mean ^ and finite variance o^. For t > ^f%o, it 
holds that P(|X-^| > t) <4P(|X| >?/4). 

Proof. First note that when t > \/8(T we have P{\X — n.\ > t /2) < 1/2, by Chebyshev's inequality. 
Let X' be an independent copy of X. Then 

2P{\X\ >t/A)>P{\X' -X\ >t/2.) >P{\X-ii\ >tA\X' -ix\ <t/2). (28) 

But 

P{\X-ix\ >tA\X' -ix\ <t/2)=P{\X-ix\ >t)P{\X' -ix\ <t/2) > ^P{\X-n\ >t). 



Combining this with (28 1, the statement of the lemma follows. □ 



The next lemma provides a universal exponential bound for the FAST statistic and is of independent 
interest. It bears some similarity to exponential bounds reported by [Bradic et al. \ ( [2010 1 for the 
Cox model. 

Lemma 2. Under assumptions A-B there exists constants Ci,C2,C3 > independent of n such 
that for any K>0 and I < j < Pn, it holds that 

P{n^l'^\dj-dj\ >Ci(l+0} < 10exp{-fV(2^^)}+C2exp(-C3?i)+?iP(|Zi^| >K). 

Proof. Fix j throughout. Assume first that |Z,y| < K for some finite K. Define the random variables 

An:=n-'j\f{Zij-ej{t)}mi{t), B, := f {Zj{t) - ej{t)}dN{t); 
f-'iJo Jo 

where A^(f) := ?i"^{A^i(f) H 'rNnit)} and ej{t) =E{Zj{t)}. Then we can write 

n'^^idj - dj) = n^l^{An -£(A„)} + 
We will deal with each term in the display separately. Since dNi{t) < 1, it holds that 

|A„| < max \Zij\ + \\ej\\oo < 2K. 

l<i<n 
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and Hoeffding's inequality ( [Hoeffdin^ 1963[ ) implies 

P{n'l^\An-E{A„)\ >t)< 2txp{-t^/{lK^)}. 



(29) 



Obtaining an analogous bound for n^/^{Bn — E{Bn)} requires a more detailed analysis. Since 

dN{t) < 1, 

(30) 



|S„|< / \Zj{t)-ej{t)\dN{t)<\\Zj-ej\\^. 
Jo 

We will obtain an exponential bound for the right-hand side via empirical process methods. Define 

■.= n-^lJ'i=iZ'tjyi{t) andf'W(0 :=£{£W(0} for/t = 0,l. Denote m := inf,e[o,T] ^'^°H0 and 
observe that m > 0, by Assumption B. Moreover, by Cauchy-Schwartz's inequality, 



Define n„ := {inffg[o,T]^'°nO ^ m/2} and let lf2„ be the indicator of this event. In view of the 
preceding display, we can write 

\m - ej{t)\ la. < ^ { I I (0 - E^"^ {t)\ + \E^'\t)-e^'\t)\}\a^ (31) 



< 2m-"(||P„ -P||fo + \\Pn -^||Fjln„ 



(32) 



with function classes F^:={fi— )-Z'^l(r>fAC>?)}. We proceed to establish exponential bounds 
for the empirical process suprema in (|32]). Each of the F^^s are Vapnik-Cervonenkis subgraph 



classes, and from Pollard (19891 there exists some finite constant C, depending only on intrinsic 



properties of the Fy^s such that 

E{\\Pn-P\\l)<i;n~'E{Zl)=n-'i;. 
In particular, it also holds that E{\\Pn — P||fJ < n^'^^i^'/^. Moreover, 

\Z\j\{Ti >t^Cl >t)-Z\jl{Ti >sACi >s)\^<K^\ s,t£ [0,t]. 



(33) 



Massart 



( |2000| l implies 

2 //ni^2\ 



With ^1 := (^^/^, the concentration theorem of ] 

PW^^\\Pn-P\\i:,>h{\+t)}<exp{-t'/{2K^)}, A: = 0,1. 
Combining (30l-(32i, taking ^2 '■= k\m^ /I, we obtain 

P{{n^l'^\Bn\ >/t2(l+0}na„) <2exp{-fV(2^^)}- 



(34) 



(35) 



whereas Cauchy-Schwarz's inequality implies 

^(fi'lnj <£||Z;-^';||ilr2„ <4m-4£{(||p„_p||p^ + ||P„_P||pj2|j^^_ < Um-^^^-i. 

Combining Lemma[T]and (35 1, there exists nonnegative (depending only on m and C,) such that 



PW^\Bn -E{Bn)\ > ^3(1 +0} < 8exp{-fV(2/:2)} +p(ii^). 



(36) 
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To bound P{Q.'^j), recall that e^^\t) > m by assumption. Consequently, 

C > m/2 for some t} C {||p„ -P||p^ > m/2}. 

By ( [33] ), we have SdlP^ — P||fo) < "^/4 eventually. By another application of the concentration the- 



orem ( [Massart 2000 1, there exists finite k4 so that -P||fo > m/4(l < k4exp{-nt^ /2). 

Setting f = 1, 

P(a^) <P(||P„-P||fo >"^/2) <A:4exp(-«/2). 

Substituting this bound in (|36]) and combining with pQ] ), omitting now the assumption that Zjj is 
bounded, it follows that there exists constants Ci,C2,C3 > such that for any K > and t > 0, 

P{n^^^\dj-5j\ >Ci(l+0} < 10exp{-?V(2i<:^)} + C2exp(-C3?i)+/'f max |Z,-^| >k). 

The statement of the lemma then follows from the union bound. □ 

Lemma 3. Suppose that Assumptions A-B hold and that there exists positive constants lo,h,ri 
such that P{\Zij\ > s) < loe^p{—hs^) for sufficiently large s. If K" < 1/2 then for any > 
there exists k2> such that 

P( max \dj - 5j\ > kin-^) < 0[p„exp{-A:2«^^"^''^''/^''+^^}]. (37) 

Suppose in addition that \5j\ > kjn"'^ whenever j G Mg and that = k/^rT^ where k'i,ki\ are 
positive constants and k^ <ki,/2. Then 

P(M'J C m;^) > 1 -0[/7„exp{-/t2?i(i"2'')''/(''+2^}]. (38) 
In particular, if log;?„ = o{„(i-2>c)t?/(j?+2)} then P(M^ C M^^) ^ 1 when n^oo. 

Proof. In Lemma|2] take \+t = kin^l'^^^/C\ and K := Then there exists positive 

constants ^2 , h such that for each j = l,...,pn, 

Pildj-djlykin-") < \Oexp{-hn^^-^''^'^^^'^+^^+nloexp{-hn^^-^''^'^/^'^+^^}. 
By the union bound, there exists ^2 > such that 

P(^max Idj-djlykin-"^ < 0[79„exp{-A:2«^i"2rc)7,/(r,+2)|]. 



which proves (37 1. Concerning (38 1, kT,n — \dj\ < \5j — dj\ by assumption and so 



pI min \dj\ <%] < Pi max IJ,- 5/1 > k4n '^-Yn] < Pi max \dj-5j\ >n %/2 ); 
V jeW- ■' J \ ;gM" ' ■' ' ) \ jeu", ■' J 



where the last inequality follows since we assume k^ <k^l2. Taking k\ =kT,/2m. ([37]), we arrive at 
the desired conclusion: 

P{Ml C M2) > 1 -/'( min \dj\ < 7,,) > 1 - 0[pn&xp{-k2n^^-^''^'^/^'^+^^}]. 
Finally, P{M"g C M^) 1 when n^oo follows immediately when logp„ = o{n^^^^'^'>'^/^^^^^. □ 
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Lemma 4. Let Z G M'^ be a random vector with zero mean and covariance matrix E. Let b G 
and suppose that E{Z,\ZJ\i) = cZ^b for some constant vector c G W\ Assume that / is some 
real function. Then 

£{Z/(Z b)} - Eb^^-^-^, (39) 

taking 0/0 := 0. If moreover / is differentiable and strictly monotonic, there exists £ > such 
that 

£|Z/(ZTb)| > Eb£/Var(ZTb). (40) 
In particular, E{Zjf{Z^\i)] = iff Cov{Zj,Z^\i) = 0. 

Proof. Set W := Z^b. By standard properties of conditional expectations, it holds that 
= E{W{Z-E{Z\W))] = Z\i-E{WE{Z\W)]=Z\i-cE{W^), 



implying E{Z\W) = EbW/Var(lV). We then obtain (|39]): 



E{Lf{Z^h)}=E{E{Z\W)f{W)} = 'LhE{Wf{W)}/Yav{W). 



To show ( [401 ), the mean value theorem implies the existence of some Q <W <W such that 
E{Wf{W))=E[W{m+f'{W)W]=E{W^f'{W)}. 

Then 

E\W^f\W)\>E{\f'{W)\W^\{W^<\)]> inf \f' {x)\E{W^\{W^ <\)}. 

0<x<l 

Strict monotonicity of / then yields (HOl). □ 



Lemma 5. Assume that the survival time T has a general, continuous hazard rate function 
XT{t\Z) depending on the random variable Z G M and that the censoring time C is independent 
of Z, T . Then 

5=1 e{t)dF{t)=E{e{TACAr)}; 
Jo 

whereF(0 ■=P{TACAr<t) and e{t) := E{ZP{T >t\Z)}/E{P{T >t)}. 

Proof. Let St , Sc denote the survival functions of T, C, conditionally on Z. Using the expression 
([8]l for 5 alongside the assumption of random censoring, we obtain 



5=E 







{Z-e{t)}Y{t)?iTit\Z)dt 



(41) 



= J^\c{t)E{ZSTmTm}<it- l'^^p^Sc{t)E{Yit)XTm}<it (42) 

= - / -eit)E{Y{t)}dt; (43) 
JO at 

where last equality follows since S'j = —XtSj. Integrating by parts, we obtain the statement of the 
lemma: 

8 = - r ^e{t)E{Y{t)]At = - f ^e{t)E{P{TACAT>t\Z)}dt = E{e{TACAT)}. 
Jo at Jo at 

□ 
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Proof of Theorem\l\ Setej{t) := E{ZijST{t,Zj a^)}/E{ST{t,Zj a^)} with5r(f, ■) =exp{- JqX{s, 
From Assumptions [T]j3} Assumption C, and Lemma |4} there exists a universal positive constant 
such that 

\5j\ = \ej{t)\ > \E{ZijST{t,Zja^)}\ > ki\Cov{Zij,Zja% j GM". 
Then M" C M'g. The sure screening property follows from Lemma |3] and the assumptions. 
Proof of Theorem |2] Suppose that 

= 0{A™ax(E)}. 

For any £ > 0, on the set B„ := {^^^\<j<p„ \dj — 8j\ < en^^}, it then holds that 

|{1 < J < A, : \dj\ > 2en-^]\ < |{1 < j < : > en-^]\ < 0{n^''K..{'^)}- 

Taking ki = 2e in Lemma [3j we have 

P[\M",\ < 0WX^,,iL)}]>P[\{j : \dj\ > krn-''}\ < 0{n^''K..im > PiBn). 



□ 



(44) 



By Lemma|3| P{B„) = 1 - 6>[/7„exp{-C3?i(^"^'^)^/('i+^)}] as claimed. So we need only verify (|44]). 

By Lemmajsj there exists a positive constant ci such that \dj\ < ci Jq \E {Zi jSt {t, ZJ a'^)}\dF{t) 
for j € M" with F the unconditional distribution function of Ti A Ci A T. In contrast, 5j = for 
j ^ M", by Assumptions 3]|4 It follows from Jensen's inequality that there exists a positive constant 
C2 such that 

\\Sf < C2 f ||£{Zi5r(f,z7a°)}f dF(0. (45) 
JO 

Lemma |4]implies 

(46) 



Var(ZTaO) 

By Cauchy-Schwartz's inequality, using that \\La^f < ||E'/^a°f < A„,ax(2)P^/^a°f , 

\\E{ZiSTit,Zja'')}f < ||Ea«||VVar(z7a°) < A„,ax(E). 



Inserting this in ( |45) then yields the desired result ( |44| ). Note that this result does not rely on the 
uniform boundedness of Yar{Zj a'^) (Assumption C). □ 

Lemma 6. Suppose that Assumption A holds and that both the survival time T\ and censoring 
time Ci follow a nonparametric Aalen model ( [TT] ) with time-varying parameters a" and , 

1 /9 ~ ~ 

respectively. Suppose moreover that Zi = E^/^Zi where Z\ has i.i.d. components and denote 
by := E{e\p{Zj\x)} the moment generating function of Zj\. Then 



\E{Yy{t)]a\tydt\i:^l^; (47) 
t)) J 



.Ax (/)(x) 

where T^{t) := E^/2/o{a*'(5) + ^^{s)]As. In particular, if Zj ~ N(0,E) then 

8 = i:{j\\t)E{Y,{t)}At}. 



(48) 
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Proof. Let Aj and Ac denote the cumulative baseline hazard functions associated with Ti and Ci. 
Combining ([8]) and (]_J_i, we get 

5 =£|^'ziZ7yi(0a°(0df} -^'£{Zi7i(f)}®^£{yi(0}"'a''(0df (49) 

= r'L^I^'R{t)i:^I^E{Yi{t)]a\t)dt; (50) 
Jo 

defining here 

£{Fi(0}£{ZiZ7yi(0}-£{Ziyi(0}^2 
^^^"^^^^^^ \ 

Since we have Yi{t) = exp[— {Ar(f) +Ac(f) +Z]'^ conditionally on Zi, independence of the 

components of Zi clearly implies [H(f)],y = for / ^ j. For / = j, factor the conditional at-risk 
indicator as Yi{t) = F/^'(f)F/ ^\t) where Y^^'^ := exp{— Ziyr^(f)}. Utilizing independence again, 
we get 

E{Yi^\t)}E{Z}jYi^\t)}-E{Yl^\t)Zyjy d f (x) 



E{Yi^\t)y ^ <^'W ^=-r1W 

This proves ( [47] ). To verify ( [48] ), simply note that the moment generating function of a standard 

Gaussian is ^{x) = exp(;<;^/2) for which d/dx{^'{x)^{x)^^) = 1. □ 

From ( |47| ), a 'simple' description of 5 (which does not involve factorizing a matrix in terms of Z'^^) 
is available exactly when features are Gaussian. Specifically, it holds for some fixed K > that 

iff ^{x) = exp(Kx^ /2), the moment generating function of a centered Gaussian random variable. 

Proof of Theorem^ We apply Lemma [6] Denote by vy the j'th canonical basis vector in RP". 
Integrating by parts in ( [48] ), we obtain 

5j = \]L 1^ a^{t)E{Yi{t)}dt = \]L I a°{t)E{P{Ti ACi AT >t)}dt = \JZE{A^{Ti ACi at)}. 

By the assumptions, |vT££{A"(ri ACi At)}| > ci?i"'^ whenever ; € M". Thus M" C M'^. For 
Gaussian Zij, we have Pi\Zij \>s)< exp ( —s^/2) , and the SIS property then follows from Lemmajs] 

□ 



Proof of Theorem [?] Recall that 

A = E 

Then 



/'{Zi-e(0}®'Fi(Od/ 

JO 



o_ r E{Yi it)}E{Yi (0ZiZ7«»} - £{71 it)Zja^}E{Yr (QZi } 
~Jo E{Y^it)} 
But by Lemma|4]and the assumption of random censoring, 

£{Fl(0ZlZ7«"} = S«" ^^^^^"T^\^'^^ and£{ZiFi(0}=Ea"^^^W??^- 
^ ^ ^ Var(ZTaO) I 1 u Var(ZTaO) 

So we can construct a function ^ such that Aa° = £a°/J'i§(Z7a°,f)df where Jq ^{Zj a^\t)dt / 
0, by nonsingularity of A. Similarly, using Lemma [5| we may construct a function such that 
5 =Ea"/o'C(Z7a",Odf. Taking v := i;{Zja^,t)dt/ ^{t,Zja)dt, i3° = va" solves A)5° = 
5. □ 
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